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ABSTRACT 

This paper analyses the effects of random noise in determining errors and 
confidence levels for galaxy redshifts obtained by cross-correlation tech- 
niques. The main finding is that confidence levels have previously been 
overestimated, and errors inaccurately calculated in certain applications. 
New formulae are presented. 

KEYWORDS: Methods: data analysis, statistical. Galaxies: distances & 
redshifts. 



1 INTRODUCTION 



It is becoming increasingly possible to obtain very large 
1^ pumbers of galaxy spectra through the use of systems 
such as those involving optical fibres. With such large 
ciatasets, automated procedures for obtaining redshifts 
0^ become desirable and necessary for efficiency. It is im- 
portant to have automatic and objective measures of the 
Q^onfidence levels and errors in the results, so that suspect 
^ redshifts may be investigated individually by manual in- 
^ spection of the data. Clearly, it is important that the 
initial automatic quality assessment is founded on a firm 
theoretical basis. 
^ ] Obtaining redshifts from galaxy spectra is commonly 
k>( ^ffscted by cross-correlating the spectrum with the spec- 
rN trum of a 'template' galaxy of known redshift. The rel- 
ative shift of the two spectra when the cross-correlation 
function (ccf ) has its highest peak is then used to estimate 
the redshift of the galaxy. The procedure for doing this 
was detailed and analysed by Tonry & Davis (1979; here- 
after TD). In the case of identical galaxy and template 
spectra the technique is excellent, but in practice there 
are differences - either intrinsic spectral differences, or the 
presence of noise. These differences can lead to an error 
in the derived redshifts, or an entirely spurious redshift 
being picked up. This paper puts the analysis of noise in 
ccf techniques on a firm footing, and presents improved 
formulae for calculating confidence levels and errors. 

2 METHOD 



may be filtered to remove long-wavelength components 
(e.g. from incomplete continuum subtraction) and/or 
short wavelength contributions on sub-resolution scales. 
A cross-correlation function is then made from the galaxy 
and template spectra, and the highest peak in the ccf is 
used to calculate the relative shift between the two. This 
is then used to deduce the galaxy redshift, assuming the 
template redshift is known (see TD for more details). 

Since the ccf technique is most conveniently imple- 
mented via Fourier methods, the spectra are opera- 
tionally assumed to be periodic, so are apodised to re- 
move spurious short- wavelength components coming from 
a mismatch at the two ends of the spectrum. 

Let the discrete Fourier transforms (FTs) of (?„ and 
tn be Gfc and T^, where, for example 
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fc = 0,...,iV-l. (1) 



the inverse transform is 



9n 



and the variance of g \s (t'z = 



With this notation, 
(l/iV)EfcGfce2™'=/^ 
(l/^)E„ff« = (V^')Efc \Gk? and aU sums run from 
to — 1 unless otherwise stated. The normalised ccf is 
formed: 
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We assume that gn is identical to tn except it is: 



(2) 



1. different by addition of a random noise component 



The method of obtaining redshifts from ccf techniques 
is presented in TD. One bins the galaxy spectrum 
gn and template tn into N bins (n = 0, ...,iV — 1) 
logarithmically-spaced in wavelength (initially we follow 
almost the same notation as TD: g{n) in TD is written 
here as (?„). The spectra are continuum-subtracted and 



2. multiplied by a 

3. shifted by 5 (in bins) 

4. broadened by convolution with a symmetric func- 
tion hr, 
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I.e. 



gn = a [{t * b)n-s + e„] (3) 
where the * indicates convolution. Specifically: 

(t * 6)„ = ^ tm bn-m- (4) 



Transforming, 



(5) 



where Bh and Eh are the Fourier transforms of the broad- 
ening function bn and the noise €„. The normalised ccf 
is 
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or c„ = a [{t * 6) X i]„_^ + [e x t]^. In Fourier space, the 
coefficients of c are (see e.g. Nussbaumer 1981, p. 80-83): 



(7) 



and A = l/{Nag(7i). Fig. 1 shows an example galaxy 
spectrum G0102-63 and a template radial velocity stan- 
dard HD136202, with a power series continuum subtrac- 
tion applied to both. Filtering has also been applied 
in Fourier space to remove very long- wavelength compo- 
nents from the spectra. Also shown is the ccf, illustrating 
the power of the cross-correlation technique. 

We now make the principal assumption for the follow- 
ing analysis: we assume the noise and the template are 
uncorrelated, and e„ is a random gaussian field. Note 
that the noise component may be a combination of pho- 
ton noise and real spectral differences between the two 
objects. In this case, the field specified by Nk = T^Ek is 

also a random gaussian field. In the absence of broaden- 
ing, cr^ = (^2 ^ ^2) 

The method of estimating the noise level proposed by 
TD is valid for a symmetric broadening function 6„. In 
this case, the FT of the first ('signal') term in (6) is sym- 
metric about n = S. Hence the asymmetric part of the 
actual ccf may be used to estimate the contribution from 
the noise. We form the symmetric and antisymmetric 
parts: 



with transforms 
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where C-k = from the reality of c„. With a symmetric 
bn, we have B^. = Bk, so 



Sk = Xa[\Tk\^Bk + {Mk + M^)] 
Ak = Xa{Mk-Ml) 



(10) 



where Mfe = iT*^;^^^^*'"^/^. Thus, as stated by TD, 
the contribution to the symmetric ccf is statistically the 
same as to the antisymmetric part, if the noise has ran- 
dom phases, uncorrelated with the template. The r.m.s. 
noise is therefore v^cTa, where Ua is the measured r.m.s. 
variation in the antisymmetric part of the ccf. An estima- 
tor for the FT of the broadening function may be made 
by making use of this property, assuming the noise and 
template are uncorrelated: 



-lA: 



(11) 



The normalisation is not important here (a redefinition 
amounts to altering a). Note that the contributions to 
the symmetric and antisymmetric powers are only equal 
in the mean, so some averaging over modes may be nec- 
essary in practice. 

3 CONFIDENCE ANALYSIS 

We want the probability that a peak in the ccf has been 
caused by noise. It is here that we may improve substan- 
tially on the analysis of TD, using the theory of peaks in 
gaussian noise originally developed by Rice (1954). 

Let the range of the ccf be L bins, and let the believ- 
able fraction of this be / (one might wish, for example, to 
exclude negative redshifts). With the Fourier technique, 
L = N. The expected number of noise peaks, in a range 
fL, higher than the observed peak cs is 



N = npks{> v)fL 



(12) 



where v = c^j^-splaa) is the height of the ccf peak in units 
of the r.m.s. noise. npksi> ^) is the number density of 
peaks above v, and is given by (Williams et al. 1991): 
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(13) 



Note that v here is identical to r in TD. The quantities 
R = ai/(j2 and 7 = a\/{(7{)<72) characterise the noise in 
terms of moments of its power spectrum: 
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(14) 



Note that the sum extends over half the space. Reality 
of the noise ensures each half contributes equally, i? is a 
length which characterises the coherence properties of the 
noise, and 7 is a dimensionless parameter which measures 
the relative contribution from short and long wavelengths. 

To obtain the confidence level C for the observed peak 
in the ccf, we need the probability that there are no higher 
noise peaks in the interval. This is a non-trivial task, and 
the theory of peaks cannot currently provide an answer. 
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However, a good approximation for high peaks is to as- 
sume that the peaks are uncorrelated (Wilhams et al. 
1991), in which case 



(15) 



This should be very accurate if TV <C 1. If not, the con- 
fidence is low, and the redshift obtained will be suspect 
anyway. 

To obtain the confidence level, we need to estimate 
the parameters R and 7 from the noise. This may be 
done either by performing an FT on the antisymmetric 
part of the ccf, and using (14), or by making use of the 
following results, which follow from (13): 

1 , 1+7 



is maximised. The first term (= c) is symmetric, peaking 
at n = (5. Expanding this term in a Taylor expansion, and 
writing the second term as e, the peak is shifted to n— 5 = 
— e'/c", the primes indicating derivatives. Thus, for an 
ensemble of ccfs with the same template but different 
noise realisations, the maxima are shifted by an r.m.s. 
amount 



Aa; = x/2<7i„/(-c'/). 



(21) 



where we have used the antisymmetric part of the ccf to 
write = 2a1g_. The second derivative at the peak may 
be written 
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(22) 



The total number of peaks Npks, and the number of 

positive peaks Npks{> 0) may therefore be used more 
straightforwardly to estimate R and 7: 



7 - — ^ — - - 1; 



(17) 



Npks ' 2nNpks 

The two methods give, in tests involving nearly a thou- 
sand ccfs, the same answers to an r.m.s. accuracy of 3% 
for R. The accuracy in 7 is not so high, 7% to 24% de- 
pending on whether the spectra are filtered or not. The 
Fourier method is to be preferred. 

For high peaks {v » 1), one may make use of the 
asymptotic expansion erfc(z) /{^z) to get an 

answer good to 3% for z/ > 2 and 7 > 0.5: 



To make further progress requires specific assumptions 
about the broadening function. Broadening by convo- 
lution with a gaussian is considered in Section 5, which 
demonstrates that the error is rather insensitive to broad- 
ening. Here therefore I consider no broadening {Bk = 1), 
in which case. 
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where af^ is the second moment of the filtered template 
spectrum. Calculation of ag requires Ue, which may be 
calculated by noting that AalE^/c] = \/2|j4fe|/|Tfe|, from 
which we get 



(7-1-2/ 



Npks{> 0) - ^Npks 



(i^»l).(18) 
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This may be compared with TD's high-peak limit Ctd — 
1 - y/2j^fNpksi> 0)6-"" /'^/ly. TD's analysis thus over- 
estimates the confidence by an amount 

"-"^ (-»!)• (19) 



So we obtain the final expression for the r.m.s. error: 
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These confidence estimates have been compared with 
TD's for a sample of around 80 galaxies with redshifts 
< 0.05, each correlated with 8 radial velocity standard 
stars and 2 nearby galaxies. The signal-to-noise of the 
ccfs is generally high, with only one with v <2.5 and most 
with ly > 5. For this test sample, the Tonry and Davis 
method systematically overestimates the confidence level 
(see Fig. 2). As an indicator, a 95% confidence level un- 
der the previous analysis corresponds roughly to a true 
confidence of 84%. 

4 ERRORS 



If the correct peak has been selected, the redshift may 
still have an error as the noise can move the position of 
the maximum. The function 

Aa 



+T:Ek). 
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(20) 



Ax = V2"-^ l-2a,^^(|A,|V|r,n .(25) 
"It '- -' 

This can be tested in two ways. The first test is to do 
exactly what this analysis assumes: a template spectrum 
has noise added, and is shifted. The ccf technique is then 
used to estimate the shift. Fig. 3a shows the distribution 
of errors, normalised to the error estimate (25), along 
with the expected gaussian of unit variance, for white 
noise. Slight deviations are expected due to the discrete 
sampling of the spectra, but the agreement is remarkably 
good. Differences may arise because, although the peak 
of the ccf is calculated to sub-pixel accuracy, the anti- 
symmetric part is calculated assuming S is an integer. 

A further weak test of the method is shown in Fig. 
3b, which is based on real data; if the differences be- 
tween galaxy spectra and templates can be characterised 
in the way suggested in this paper, the errors should be 
distributed normally. The figure shows the distribution 
of errors, each normalised to the r.m.s. (for each galaxy 
cross-correlated to the ten templates) , and the ideal case 
of a gaussian of unit variance. Cases where the wrong 
ccf peak has been selected are almost always easy to spot 
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and these have been removed. Note that it is incorrect 
to use the standard error for the mean of the n redshifts 
{v.m.s./^/n if all errors are the same), as the noise in the 
galaxy spectrum may shift the derived redshift systemat- 
ically for all templates. 

4.1 Filtering 

The results shown so far have used a FIGARO rou- 
tine SCROSS, which has been generalised to include er- 
rors (available on some STARLINK sites as XCORR). 
SCROSS filters out low-frequency components after con- 
tinuum subtraction, whereas XCORR offers the user the 
option not to do this. In XCORR, the errors are esti- 
mated from the actual full-width at half-maximum W of 
the peak in the ccf: error = 0.283311^/(1 -|- v). A similar 
routine, XCSAO, written for IRAF (Kurtz et al. 1992) 
uses the formula 0.375VF/(1 + i'). These work reason- 
ably well in practice, but there can be cases where noise 
broadens the wings of the peak substantially, while the 
peak remains locally sharp, as illustrated in Fig. 4. The 
error in such cases can be grossly overestimated. Fig. 1 
shows the filtered case, for the same spectra. Here the 
actual width is a better guide, but sometimes provides 
a poor error estimate, usually an overestimate. The for- 
mula (25) is based on local properties at the peak itself, 
so is insensitive to any broad wings. 

5 GAUSSIAN POWER SPECTRA 

In the ideal case of no noise and no broadening, the nor- 
malised ccf has a peak of 1. In practice, the height of the 
peak falls below this, and one ought to be able to use the 
actual height to estimate the significance of the redshift 
obtained. This is indeed the case, and one can get a rule- 
of-thumb estimate by making a specific assumption that 
the power spectra involved are gaussian. If the spectra 
are filtered, this approximation is usually quite good. 

We assume the broadening function is a gaussian, with 
width a, that the template and noise have gaussian power 
spectra with characteristic widths r and r]. i.e. 



Evaluating the sums in sections 2 and 3, we find 

T 



2 2/2 



The mean value of the second derivative obeys 



< -cs >-- 



(31) 
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and if we approximate the ccf peak as a gaussian, c„ ~ 
c(J) exp [— (n — (5)^/(2/^^)] , matching second derivatives 
at the peak gives 



(33) 



in agreement with the maximum likelihood method of 
TD. Prom this, one can obtain the velocity broadening in 
the galaxy provided any additional filtering is accounted 
for. 

For gaussian power spectra, the error may be written 
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To evaluate this, we note that the noise term in the ccf 
has a variance 
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and we may use this to eliminate Ue/crt from (25). After 
some algebra, (25) becomes 
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For high peaks this reduces to 
Ax = 
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A result useful for the estimators below is 
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(26) 
(27) 

(28) 
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which demonstrates the insensitivity of the error to the 

broadening. One can write the error in terms of the num- 
ber of peaks, with the hope that the result can be used 
generally. If one assumes 77 ~ r a and uses 
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then the error may be written 
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(38) 



(39) 



Unforturtunately, in our sample, this turns out to be 
a poor estimator, underestimating errors by a factor of 
about 3. 
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6 CONCLUSIONS 

This paper has analysed the effects of noise on the confi- 
dence levels and errors in the cross-correlation technique 
widely used to obtain galaxy redshifts. This improved 
treatment finds somewhat lower confidence levels than 
previously found, and also provides a formally more cor- 
rect error assessment for the redshifts obtained. The prin- 
cipal results are the confidence level (equation 15 with 
13, or the approximation, equation 18), and the error es- 
timate (equation 25). 

The results should be particularly useful in identify- 
ing questionable redshifts in programmes which obtain 
large numbers of galaxy spectra. In particular, a low 
confidence level would indicate that the object should 
be checked carefully before the redshift is accepted. It 
should be borne in mind that errors may well arise from 
factors which are not treated in this paper, principally 
wavelength calibration errors and spectral differences be- 
tween galaxies and templates which cannot be described 
by noise (even when filtered). 

It is hoped to provide a FIGARO version of the cross- 
correlation program SCROSS to provide error and confi- 
dence analysis. 
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Figure Captions 

Figure 1. Sample spectra of a galaxy and template 
star, and their cross-correlation function. In the cross- 
correlation process, a Fourier filter has been applied to 
remove long- wavelength components. 
Figure 2. Confidence assessments compared with the 
method of Tonry and Davis for about 800 filtered ccfs. 



based on aroimd 80 galaxies with redshifts < 0.05, and 
10 templates, 8 of which are radial velocity standard 
stars and the other two nearby galaxies. The spectra 
are mostly good signal-to-noisc, with all ccf peak heights 
except one above v = 2.5, and about 2/3 above v = 5. 
Figure 3. The distribution of errors, normalised to the 
sample standard deviation by a) adding white noise to a 
template and cross-correlating in the normal way. The 
error plotted is divided by the program error estimate 
(25) and the distribution should follow the solid curve 
b) cross-correlating real galaxies with 10 templates. The 
error estimate in this case is the r.m.s. deviation from the 
mean of the templates. If the assumptions in the analysis 
were strictly realised in practice, the distribution should 
be normal (solid curve). 

Figure 4. The ccf for the galaxy-template pair of figure 
1, but with no long- wavelength filtering applied. The 
actual FWHM of the peak is a poor estimate of the error. 



